feat: `ngen_cal_model_observations` model plugin hook and default implementation #155

aaraney · 2024-08-05T16:42:57Z

ngen_cal_model_observations is called during each calibration iteration to provide truth / observation values in the form of a pandas.Series, indexed by time with a record every simulation_interval. The returned pandas.Series should be in units of cubic meters per second.

A default implementation is included in this PR the pulls data from the USGS NWIS IV data service using hydrotools.nwis_client. This is enabled by default and does not require modifying configuration files.

Hook signature:

class ModelHooks:
    @hookspec(firstresult=True)
    def ngen_cal_model_observations(
        self,
        id: str,
        start_time: datetime,
        end_time: datetime,
        simulation_interval: pd.Timedelta,
    ) -> pd.Series:
        """
        `id`: USGS gage station id
        `start_time`, `end_time`: inclusive simulation time range
        `simulation_interval`: time (distance) between simulation values
        """

Users who would like to load observation data from a file, for example, should implement this plugin hook and specify it as a model plugin in your ngen.cal configuration file. For example:

# ngen_cal_observation_plugin.py
from __future__ import annotations

import pandas as pd
from ngen.cal import hookimpl

import typing
if typing.TYPE_CHECKING:
    from datetime import datetime

class NgenCalObservationPlugin:
    @hookimpl(trylast=True)
    def ngen_cal_model_observations(
        self,
        id: str,
        start_time: datetime,
        end_time: datetime,
        simulation_interval: pd.Timedelta,
    ) -> pd.Series:
        df = pd.read_csv("local_observations.csv")

        assert id in df["sites"].values, f"{id} not in {df['sites'].values}"
        df = df[(df["sites"] == id) && (df["time"] <= start_time ]

        df = df[(df["time"] >= start_time) && (df["time"] <= end_time)]

        df.set_index("time", inplace=True)

        ds = df["value"].resample(simulation_interval).nearest()
        return ds

Likewise, if you would like to, instead, save the observations used during calibration, you can implement this plugin as a "wrapper" style plugin. For example:

# ngen_cal_observation_writer_plugin.py
from __future__ import annotations

import pandas as pd
from ngen.cal import hookimpl

import typing
if typing.TYPE_CHECKING:
    from datetime import datetime

class NgenCalObservationWriterPlugin:
    @hookimpl(wrapper=True)
    def ngen_cal_model_observations(
        self,
        id: str,
        start_time: datetime,
        end_time: datetime,
        simulation_interval: pd.Timedelta,
    ) -> pd.Series | None:
        ds = yield
        if ds is None:
            return
        ds.to_csv(f"ngen_cal_observations-{id}.csv")
        return ds

Related to #93
Related to #111

Additions

ngen_cal_model_observations - called during each calibration iteration to provide truth / observation values in the form of a pandas.Series, indexed by time with a record every simulation_interval.

aaraney · 2024-08-12T19:38:10Z

python/ngen_cal/src/ngen/cal/calibration_set.py

+        simulation_interval: pd.Timedelta = pd.Timedelta(3600, unit="s")
+        obs = self._hooks.ngen_cal_model_observations(
+            id=location.station_id,
+            start_time=start_time,


In reality I think the caller should pass start_time + 1 dt (e.g. 3600s) since the model outputs will not contain values for the actual start_time (left exclusive). ~~I don't think this is a problem now, but just wanted to note it.~~

It does matter and things will break if we don't properly account for this. This only works right now b.c. csv_output's dt is 300s which means the first value is start_time + 5min. When we resample the simulation output to the hour using .resample("h").first() the simulated value at 5min is backfilled to start_time and lines up with the nwis observations.

This likely heavily depends on how the data is "merged", in search.py _objective_function. Currently this done with a pandas merge with left_index=True, right_index=True which should result in a data frame of the overlapping indicies.

hellkite500

A couple things to consider, we can chat about this directly.

hellkite500 · 2024-08-14T14:42:15Z

python/ngen_cal/src/ngen/cal/_hookspec.py

+        The returned pandas Series should be in units of cubic meters per
+        second.
+
+        `id`: USGS gage station id


id doesn't strictly have to be USGS gage id, but more generally some identifier linked/referenced to the hydrofabric...we just happen to use gage id's for now. Perhaps this should ultimately be the Hydrofabric feature id instead of the obs id? Or we may need a feature id and a linked id?

Why don't we pass a Nexus feature here which should be able to embody additional context extracted from the hydrofabric which can be used?

Great questions! Glad to hear that we are thinking on the same page. Two thoughts preface with: I went with this design b.c. I decided to just keep it as simple as possible for now. Since we are pre-1.0, I figured we might change this as we learn things along the way.

I'd originally written up a complex object that encapsulates what Mike's group refers to a point-of-interest. Like you are saying, the thinking being that the id for the model data is likely not the same id for the observation (truth) data. I decided against that to err on the side of simplicity for now. I am open changing that now or in the future.

I also though about passing a Nexus, but thinking back to previous conversations we'd had about keeping things model agnostic, it seemed passing a NextGen specific idea went against that. If we ever do go in the direction of something like a POI interface as a solution to this problem, I figured we could always create a type that implements the POI interface for a Nexus for example.

Conversation with @hellkite500, in theory a HydroLocation or Nexus makes the most sense here (really a Nexus b.c. its more general). We have decided to go with Nexus for now.

I think the Nexus is the right way to go for now, it encapsulates a (eventually) list of hydrolocation which can reference different entities.

Related: NOAA-OWP/hypy#37

python/ngen_cal/src/ngen/cal/ngen_hooks/observations.py

hellkite500

As a follow up PR, we should create a "user" hook which is not in core which reads a typical USGS csv/tabular file and create a quick example of how to use that instead of the default hook via config

aaraney requested a review from hellkite500 August 5, 2024 16:42

aaraney self-assigned this Aug 5, 2024

aaraney added enhancement New feature or request ngen.cal Related to ngen.cal package labels Aug 5, 2024

This was referenced Aug 5, 2024

add option to save observed streamflow data used for calculating performance metrics #93

Open

Improvement: Include options for model evaluation (runoff versus streamflow) #20

Open

aaraney commented Aug 12, 2024

View reviewed changes

hellkite500 reviewed Aug 14, 2024

View reviewed changes

aaraney force-pushed the obs-hook branch 2 times, most recently from 2169ed2 to 2d13584 Compare August 15, 2024 18:11

hellkite500 approved these changes Aug 15, 2024

View reviewed changes

aaraney added 6 commits August 15, 2024 14:21

feat: ngen_cal_model_observation hook

ca18264

feat: ngen_cal_model_observations plugin, NwisObservations

57a8f5f

feat: CalibrationSet gets observations from ngen_cal_model_observations

ab6a83e

refactor: register default ngen plugins in one place

3deac49

feat: register UsgsObservations plugin

50700f0

test: ngen_cal_model_observations registration

230e213

aaraney force-pushed the obs-hook branch from 2d13584 to 230e213 Compare August 15, 2024 18:22

hellkite500 merged commit 22b2a71 into NOAA-OWP:master Aug 15, 2024
12 checks passed

aaraney mentioned this pull request Aug 21, 2024

Port functionality for reading observations from local files #111

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: `ngen_cal_model_observations` model plugin hook and default implementation #155

feat: `ngen_cal_model_observations` model plugin hook and default implementation #155

aaraney commented Aug 5, 2024

aaraney Aug 12, 2024 •

edited

Loading

aaraney Aug 12, 2024

hellkite500 Aug 15, 2024

hellkite500 left a comment

hellkite500 Aug 14, 2024

hellkite500 Aug 14, 2024

aaraney Aug 14, 2024

aaraney Aug 14, 2024

hellkite500 Aug 14, 2024

aaraney Aug 14, 2024

hellkite500 left a comment

feat: ngen_cal_model_observations model plugin hook and default implementation #155

feat: ngen_cal_model_observations model plugin hook and default implementation #155

Conversation

aaraney commented Aug 5, 2024

Additions

aaraney Aug 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hellkite500 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hellkite500 left a comment

Choose a reason for hiding this comment

feat: `ngen_cal_model_observations` model plugin hook and default implementation #155

feat: `ngen_cal_model_observations` model plugin hook and default implementation #155

aaraney Aug 12, 2024 •

edited

Loading